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Abstract 

Bayesian optimization is a powerful tool for fine- 
tuning the hyper-parameters of a wide variety of 
machine learning models. The success of ma¬ 
chine learning has led practitioners in diverse 
real-world settings to learn classifiers for prac¬ 
tical problems. As machine learning becomes 
commonplace, Bayesian optimization becomes 
an attractive method for practitioners to automate 
the process of classifier hyper-parameter tuning. 

A key observation is that the data used for tun¬ 
ing models in these settings is often sensitive. 
Certain data such as genetic predisposition, per¬ 
sonal email statistics, and car accident history, if 
not properly private, may be at risk of being in¬ 
ferred from Bayesian optimization outputs. To 
address this, we introduce methods for releas¬ 
ing the best hyper-parameters and classifier ac¬ 
curacy privately. Leveraging the strong theoreti¬ 
cal guarantees of differential privacy and known 
Bayesian optimization convergence bounds, we 
prove that under a GP assumption these private 
quantities are also near-optimal. Linally, even if 
this assumption is not satisfied, we can use dif¬ 
ferent smoothness guarantees to protect privacy. 

1. Introduction 

Machine learning is increasingly used in application areas 
with sensitive data. Lor example, hospitals use machine 
learning to predict if a patient is likely to be readmitted 
soon (Yu et ak, 2013), webmail providers classify spam 
emails from non-spam (Weinberger et ak, 2009), and in¬ 
surance providers forecast the extent of bodily injury in car 
crashes (Chong et ak, 2005). 

In these scenarios data cannot be shared legally, but compa¬ 
nies and hospitals may want to share hyper-parameters and 


validation accuracies through publications or other means. 
However, data-holders must be careful, as even a small 
amount of information can compromise privacy. 

Which hyper-parameter setting yields the highest accuracy 
can reveal sensitive information about individuals in the 
validation or training data set, reminiscent of reconstruc¬ 
tion attacks described by Dwork & Roth (2013) and Dinur 
& Nissim (2003). Lor example, imagine updated hyper¬ 
parameters are released right after a prominent public fig¬ 
ure is admitted to a hospital. If a hyper-parameter is known 
to correlate strongly with a particular disease the patient is 
suspected to have, an attacker could make a direct correla¬ 
tion between the hyper-parameter value and the individual. 

To prevent this sort of attack, we develop a set of algorithms 
that automatically fine-tune the hyper-parameters of a ma¬ 
chine learning algorithm while provably preserving differ¬ 
ential privacy (Dwork et ak, 2006b). Our approach lever¬ 
ages recent results on Bayesian optimization (Snoek et ak, 
2012; Hutter et ak, 2011; Bergstra & Bengio, 2012; Gard¬ 
ner et ak, 2014), training a Gaussian process (GP) (Ras¬ 
mussen & Williams, 2006) to accurately predict and max¬ 
imize the validation gain of hyper-parameter settings. We 
show that the GP model in Bayesian optimization allows 
us to release noisy final hyper-parameter settings to protect 
against aforementioned privacy attacks, while only sacri¬ 
ficing a tiny, bounded amount of validation gain. 

Our privacy guarantees hold for releasing the best hyper¬ 
parameters and best validation gain. Specifically our con¬ 
tributions are as follows: 

• We derive, to the best of our knowledge, the first 
framework for Bayesian optimization with provable 
differential privacy guarantees, 

• We develop variations both with and without observa¬ 
tion noise, and 

• We show that even if our validation gain is not drawn 
from a Gaussian process, we can guarantee differen¬ 
tial privacy under different smoothness assumptions. 
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We begin with background on Bayesian optimization and 
differential privacy we will use to prove our guarantees. 

2. Background 

In general, our aim will be to protect the privacy of a val¬ 
idation dataset of sensitive records V C A' (where X is 
the collection of all possible records) when the results of 
Bayesian optimization depends on V. 

Bayesian optimization. Our goal is to maximize an un¬ 
known function /y : A —> K that depends on some valida¬ 
tion dataset V C A”: 

max/v(A). (1) 

AgA 

It is important to point out that all of our results hold for the 
general setting of eq. (1), but throughout the paper, we use 
the vocabulary of a common application: that of machine 
learning hyper-parameter tuning. In this case /v(A) is the 
gain of a learning algorithm evaluated on validation dataset 
V that was trained with hyper-parameters A S A C 

As evaluating /y is expensive (e.g., each evaluation re¬ 
quires training a learning algorithm), Bayesian optimiza¬ 
tion gives a procedure for selecting a small number of loca¬ 
tions to sample /y: [Ai,..., At] = At G Specif¬ 

ically, given a current sample At, we observe a valida¬ 
tion gain vt such that vt = /v(At) + at, where at ^ 
JV{0,a^) is Gaussian noise with possibly non-zero vari¬ 
ance cr^. Then, given vt and previously observed values 
vi,. .. ,Vt-i, Bayesian optimization updates its belief of 
/y and samples a new hyper-parameter Af+i. Each step of 
the optimization proceeds in this way. 

To decide which hyper-parameter to sample next, Bayesian 
optimization places a prior distribution over /y and updates 
it after every (possibly noisy) function observation. One 
popular prior distribution over functions is the Gaussian 
process ,■)') (Rasmussen & Williams, 2006), 

parameterized by a mean function /r(-) (we set /i = 0, 
w.l.o.g.) and a kernel covariance function fc(-, •). Functions 
drawn from a Gaussian process have the property that any 
finite set of values of the function are normally distributed. 
Additionally, given samples At = [Ai,..., At] and ob¬ 
servations Vt = [t;i,..., Vt], the GP posterior mean and 
variance has a closed form: 

= fc(A, At)(Kt + Vt 
kriX, X') = k{X, A') - fc(A, At)(Kt -b Vl)" V(At, A') 
tT^(A) = fcT(A, A), (2) 

where fc(A, At) is evaluated element-wise on each 

of the T columns of At. As well, Kt = ^(XtjXt) € 


and A G A is any hyper-parameter. As more sam¬ 
ples are observed, the posterior mean function /rT(A) ap¬ 
proaches /y(A). 

One well-known method to select hyper-parameters A max¬ 
imizes the upper-confidence bound (UCB) of the posterior 
GP model of /y (Auer et al., 2002; Srinivas et al., 2010): 

At+i = argmax^((A) -b v^/3t+icrt(A), (3) 
AgA 

where fir+i is a parameter that trades off the exploita¬ 
tion of maximizing /J.t(A) and the exploration of maximiz¬ 
ing crt(X). Srinivas et al. (2010) proved that given cer¬ 
tain assumptions on /y and fixed, non-zero observation 
noise (a^ > 0), selecting hyper-parameters A to maximize 
eq. (3) is a no-regret Bayesian optimization procedure: 
limT^oo ^ ELi /v(A*) - /v(At) = 0, where /y(A*) is 
the maximizer of eq. (1). For the no-noise setting, de Fre¬ 
itas et al. (2012) give a UCB-based no-regret algorithm. 

Contributions. Alongside maximizing /y, we would 
like to guarantee that if /y depends on (sensitive) valida¬ 
tion data, we can release information about /y so that the 
data V remains private. Specifically, we may wish to re¬ 
lease (a) our best guess A = arg max^^^ /y (At ) of the true 
(unknown) maximizer A* and (b) our best guess /y (A) of 
the true (also unknown) maximum objective /y(A*). The 
primary question this work aims to answer is: how can we 
release private versions of A and /y(A) that are close to 
their true values, or better, the values A* and /y(A*)? We 
give two answers to these questions. The first will make 
a Gaussian process assumption on /y, which we describe 
immediately below. The second, described in Section 5, 
will utilize Fipschitz and convexity assumptions to guaran¬ 
tee privacy in the event the GP assumption does not hold. 

Setting. For our first answer to this question, let us de¬ 
fine a Gaussian process over hyper-parameters X, X' € A 
and datasets V,V' C X as follows: G'P(0,ki{V,V') (g) 
fc 2 (A, A')). A prior of this form is known as a multi-task 
Gaussian process (Bonilla et al., 2008). Many choices for 
ki and ^2 are possible. The function ^1(12,12') defines a 
set kernel (e.g., a function of the number of records that 
differ between V and V')- For k 2 , we focus on either the 
squared exponential: fc 2 (A, A') = exp(—jjA — A'|||/(2£^)) 
or Matern kernels: (e.g., for v = 5/2, A: 2 (A, A') = (1 -b 
-b (5r^)/(3f^)) exp(—A/5r/f), for r = jjA — A'|| 2 ), 
for a fixed f, as they have known bounds on the maximum 
information gain (Srinivas et al., 2010). Note that as de¬ 
fined, the kernel ^2 is normalized (i.e., fc 2 (A, A) = 1). 

Assumption 1. We have a problem of type (1), where 
all possible dataset functions [fi, ■ ■ ■, f 2 \x\] are GP dis¬ 
tributed GV(0,ki{V,V') (E> fc 2 (A, A')) for known kernels 
ki , V. for all V ,V' G X and A, A' G A, where j A] < oo. 
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Similar Gaussian process assumptions have been made in 
previous work (Srinivas et al., 2010). For a result in the no¬ 
noise observation setting, we will make use of the assump¬ 
tions of de Freitas et al. (2012) for our privacy guarantees, 
as described in Section 4. 

2.1. Differential Privacy 

One of the most widely accepted frameworks for private 
data release is differential privacy (Dwork et al., 2006b), 
which has been shown to be robust to a variety of privacy 
attacks (Ganta et al., 2008; Sweeney, 1997; Narayanan & 
Shmatikov, 2008). Given an algorithm A that outputs a 
value A when run on dataset V, the goal of differential pri¬ 
vacy is to ‘hide’ the effect of a small change in V on the 
output of A. Equivalently, an attacker should not be able 
to tell if a private record was swapped in V just by looking 
at the output of A. If two datasets V, V' differ by swap¬ 
ping a single element, we will refer to them as neighboring 
datasets. Note that any non-trivial algorithm (i.e., an algo¬ 
rithm A that outputs different values on V and V' for some 
pairV,V' C X) must include some amount of randomness 
to guarantee such a change in V is unobservable in the out¬ 
put A of Al (Dwork & Roth, 2013). The level of privacy 
we wish to guarantee decides the amount of randomness 
we need to add to A (better privacy requires increased ran¬ 
domness). Formally, the definition of differential privacy is 
stated below. 

Definition 1. A randomized algorithm A is (e, i5)- 
differentially private for e, 5 > 0 if for all A € Range(Al) 
and for all neighboring datasets V, V' (i.e., such that V and 
V' differ by swapping one record) we have that 

Pr[yf(V) = A] < e" Pr[yf(V') = A] -f <5. (4) 

The parameters e, S guarantee how private A is; the smaller, 
the more private. The maximum privacy is e = (5 = 0 in 
which case eq. (4) holds with equality. This can be seen by 
the fact that V and V' can be swapped in the definition, and 
thus the inequality holds in both directions. If 5 = 0, we 
say the algorithm is simply e-differentially private. For a 
survey on differential privacy we refer the interested reader 
to Dwork & Roth (2013). 

There are two popular methods for making an algorithm e- 
differentially private: (a) the Laplace mechanism (Dwork 
et al., 2006b), in which we add random noise to A and (b) 
the exponential mechanism (McSherry & Talwar, 2007), 
which draws a random output A such that A « A. For 
each mechanism we must define an intermediate quan¬ 
tity called the global sensitivity describing how much A 
changes when V changes. 

Definition 2. (Laplace mechanism) The global sensitivity 
of an algorithm A over all neighboring datasets V,V' (i.e.. 


V, V' differ by swapping one record) is 

= max ||Al(V) — Al('F')||i. 

(Exponential mechanism) The global sensitivity of a func¬ 
tion < 7 : A” X A —> K over all neighboring datasets V, V' is 

Aq= max ||q(V, A) - g(V', A)|li. 
v.v'cx 
AgA 

The Laplace mechanism hides the output of A by perturb¬ 
ing its output with some amount of random noise. 
Definition 3. Given a dataset V and an algorithm A, the 
Laplace mechanism returns A{V) -f ui, where ui is a noise 
variable drawn from Lap(A^/e), the Laplace distribution 
with scale parameter Aj^jt (and location paratneter 0). 

The exponential mechanism draws a slightly different A 
that is ‘close’ to A, the output of A. 

Definition 4. Given a dataset V and an algorithm 
A(V) — arg max;)^gjY <7(17, A), the exponential mecha¬ 
nism returns X, where A is drawn from the distribution 
exp(e( 7 (V, A)/(2Aq)), and Z is a normalizing constant. 

Given A, a possible set of hyper-parameters, we derive 
methods for privately releasing the best hyper-parameters 
and the best function values /y, approximately solving 
eq. (1). We first address the setting with observation noise 
(cr^ > 0) in eq. (2) and then describe small modifications 
for the no-noise setting. Lor each setting we use the UCB 
sampling technique in eq. (3) to derive our private results. 

3. With observation noise 

In general cases of Bayesian optimization, observation 
noise occurs in a variety of real-world modeling settings 
such as sensor measurement prediction (Krause et al., 
2008). In hyper-parameter tuning, noise in the validation 
gain may be as a result of noisy validation or training fea¬ 
tures. 

In the sections that follow, although the quantities /, p, a, v 
all depend on the validation dataset V, for notational sim¬ 
plicity we will occasionally omit the subscript V. Similarly, 
for V' we will often write: /', p!, 

3.1. Private near-maximum hyper-parameters 

In this section we guarantee that releasing A in Algorithm 
1 is private (Theorem 1) and that it is near-optimal (The¬ 
orem 2). Our proof strategy is as follows: we will first 
demonstrate the global sensitivity of pt{X) with probabil¬ 
ity at least 1 — 5. Then we will show show that releasing A 
via the exponential mechanism is (e, (5)-differentially pri¬ 
vate. Linally, we prove that pt{X) is close to /(A*), the 
true maximizer of eq. (1). 
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Algorithm 1 Private Bayesian Opt. (noisy observations) 


Input: V; A C T; (e, 5); cry g; 7 ^ 
Mv,o = 0 

for t = \ .. .T do 


/3* = 21og(|A|t2,rV(3<5)) 

At = argmax;^gA/iv.i-i(A) + 


Observe validation gain wy.t, given At 
Update /rv,t and cry ^ according to (2) 

end for 


c = 2^(l-fc(V,V')) log (3|A|/<5) 

q = cri/4log(3/(5) 

Cl = 8 /log(l + cr- 2 ) 

Draw A G A w.p. Pr[A] oc 

i;*=maxt<Tr’v,i 


Draw 9 ~ Lap 


%/Ci/3t7t _i_ c 2 

eVr 7 


■D = r;* + 0 
Return: A,ti 


Global sensitivity. As a first step we bound the global 
sensitivity of /cr(A) as follows; 

Theorem 1. Given Assumption 1, for any two neighboring 
datasets V, and for all X G A with probability at least 
1 — (5 there is an upper bound on the global sensitivity (in 
the exponential mechanism sense) of px- 

\p-tW ~ Mt(A)| < 2 a/ Pt +1 + criy/21og (3|A|/(5), 

for (Ti = ^2{l-ki{V,V')), /?t = 21 og(^|A|f 27 r 2 /( 3 J)^. 

Proof Note that, by applying the triangle inequality twice, 
for all A G A, 

Im^(A) - mt(A)| < Im^(A) - /'(A)| + |r(A) - /iT(A)| 
< |m/(A) - /'(A)| + |r(A) - /(A)| + |/(A) - mt(A)|. 

We can now bound each one of the terms in the summation 
on the right hand side (RHS) with probability at least 
According to Srinivas et al. (2010), Lemma 5.1, we obtain 
\PtW ~ f'W \ ^ y//3T+icr^(A). The same can be applied 
to |/(A) — pt{X)\. As cr^(A) < 1, because fc(A, A) = 1, 
we can upper bound both terms by 2^^/3t+i- In order to 
bound the remaining (middle) term on the RHS recall that 
for a random variable Z~A/'(0,1) we have; Pr[|Z| > 7 ] < 
g -7 / 2 ^ Pqj- variables Zi,... Zn ~ A/'(0,1), we have, by 
the union bound, that Pr[Vf, \Zi\ < 7 ] > 1 — = 

1 — If we set Z = and n = |A|, we obtain 

7 = y^2 log(3|A|/(5), which completes the proof. ■ 

We remark that all of the quantities in Theorem 1 are either 
given or selected by the modeler (e.g, S, T). Given this 
upper bound we can apply the exponential mechanism to 
release A privately, as per Definition 1 ; 


Corollary 1. Let Al(V) denote Algorithm 1 applied on 
dataset V. Given Assumption 1, A is (e, 5)-dijferentially 
private, i.e., Pr[Al()2) = A] < Pr[Al(V') = A] +S,for any 
pair of neighboring datasets V, V. 

We leave the proof of Corollary 1 to the supplementary 
material. Even though we must release a noisy hyper¬ 
parameter setting A, it is in fact near-optimal. 

Theorem 2. Given Assumption 1 the following near- 
optimal approximation guarantee for releasing A holds: 

Pt{X) > f{X*) - 2y%^ - q - ^(log|A|-fa) 

w.p. > 1 — ((5 -f e““), where A = 2a//3t+i + c (for /3t-i-i. 
c, and q defined as in Algorithm 1 ). 

Proof. In general, the exponential mechanism selects A that 
is close to the maximum A (McSherry & Talwar, 2007); 

Pt{X) > maxAGAriT(A) - ^(log|A| -fa), (5) 

with probability at least 1 — e““. Recall we assume that at 
each optimization step we observe noisy gain Vt = /(At) -f 
at, where at ffiO, cr'^) (with fixed noise variance > 
0). As such, we can lower bound the term maxAgA Mr (A); 

max/rT(A) > f{XT)-\-aT 
AgA V--/ 

Vt 

/(A*) - maxuriX) < /(A*) - /(Ar) + ar 

AgA 

/(A*) — mBx.p,x{X) < 2a/ Pt(^t—i{Xt) + ctt 

AgA 

maxp,T(X) > f{X*)-2^/j^ttT, ( 6 ) 

AgA 

where the third line follows from Srinivas et al. (2010); 
Lemma 5.2 and the fourth line from the fact that 

(Tt—i(,Xt) a 1 - 

As in the proof of Theorem 1, given a normal random vari¬ 
able Z~A/'(0,1) we have; Pr[|Z| < 7 ] > 1— 6 “''' ;= 1—|. 
Therefore if we set Z = ^ we have 7 = yj2 log(2/5). 
This implies that |ar | < cr a/2 log(2/5) < A/41og(3/5) = q 
(as defined in Algorithm 1) with probability at least 1— |. 
Thus, we can lower bound ax by —q. We can then lower 
bound maxAgA Mr (A) in eq. (5) with the right hand side 
of eq. ( 6 ). Therefore, given the fix in Algorithm 1, Srini¬ 
vas et al. (2010), Lemma 5.2 holds with probability at least 
1 — I and the theorem statement follows. ■ 

3.2. Private near-maximum validation gain 

In this section we demonstrate releasing the validation gain 
V in Algorithm 1 is private (Theorem 3) and that the noise 
we add to ensure privacy is bounded with high probability 
(Theorem 4). As in the previous section our approach will 
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be to first derive the global sensitivity of the maximum v 
found by Algorithm 1. Then we show releasing v is (e, <5)- 
differentially private via the Laplace mechanism. Perhaps 
surprisingly, we also show that v is close to /(A*). 

Global sensitivity. We bound the global sensitivity of the 
maximum v found with Bayesian optimization and UCB: 

Theorem 3. Given Assumption 1, and neighboring V, V', 
we have the following global sensitivity bound (in the 
Laplace mechanism sense) for the maximum v, w.p. > 1—<5 

I / I , s/CiPt^t , , 

max Vi — max v* < -^-h c + g. 

t<T t<T 

where the maximum Gaussian process information gain 
is bounded above for the squared exponential and Matern 
kernels (Srinivas et al, 2010). 

Proof For notational simplicity let us denote the regret 
term as G = y/CiTfiTlT- Then from Theorem 1 in Srini¬ 
vas et al. ( 2010 ) we have that 

T 

^ > fix*) ^ - max/(AO. (7) 

- 

This implies /(A*) < maxt<r /(A*) -I- p with probability 
at least 1 — | (with appropriate choice of Pt)- 

Recall that in the proof of Theorem 1 we showed that 
|/(A) — //A) I < c with probability at least 1 — f (for 
c given in Algorithm 1). This along with the above ex¬ 
pression imply the following two sets of inequalities with 
probability greater than 1 — ^: 

fix*) - c < fix*) < max/(At) -b p; 

fix*) - c < fix*) < max/'(At) -b p. 


We can immediately bound the final term on the right hand 
side. Note that as vt = /(At) -b at, the first two terms are 
bounded above by |a| and \a'\, where a = {a\^t\ I [^1 = 
argmaX(<;j. |Q;t|} (similarly for a'). This is because, in 
the worst case, the observation noise shifts the observed 
maximum maxt<T Vt up or down by a. Therefore, let a = 
q; if |a| > I a'I and a = a' otherwise, so that we have: 

I max'uTAt) — maxu(At)| < S -b c -b | 2 d|. 

Although Id I can be arbitrarily large, recall that for Z ^ 
A/'(0,1) we have: Pr[|^| < 7 ] > 1 — = 1 — 

Therefore if we set Z = we have 7 = yj2 log(3/5). 

This implies that |2d| < a4,\ogii/5) = q with probabil¬ 
ity at least 1 — f- Therefore, if Theorem 1 from Srinivas 
et al. (2010) and the bound on |/(A) — /'(A) | hold together 
with probability at least 1 — ^ as described above, the the¬ 
orem follows directly. ■ 

As in Theorem 1 each quantity in the above bound is given 
in Algorithm 1 (f3, c, g), given in previous results (Srinivas 
et al., 2010) Qjt, Ci) or specified by the modeler (T, A). 
Now that we have a bound on the sensitivity of the max¬ 
imum V we will use the Laplace mechanism to prove our 
privacy guarantee (proof in supplementary material): 

Corollary 2. Let Al(V) denote Algorithm 1 run on dataset 
V. Given Assumption 1, releasing v is (e, 5)-dijferentially 
private, i.e., Pr[^(V) =u] < e" Pr[^(V') =0]-b(5. 

Further, as the Laplace distribution has exponential tails, 
the noise we add to obtain v is not too large: 

Theorem 4. Given the assumptions of Theorem 1, we have 
the following bound, 

\v - fiX*)\ < V21og(2r/5) + P + + £ + 2 ), 


These, in turn, imply the two sets of inequalities: 
max/'(At) < fix*) < max/(At) -b p -b c; 
max/(At) < fix*) < max/'(At) -b p -b c. 

This implies | maxt<T /'(At) - maxt<T /(At)| < p -b c. 
That is, the global sensitivity of maxt<T /(At) is bounded. 
Given the sensitivity of the maximum /, we can readily 
derive the sensitivity of maximum v. First note that we can 
use the triangle inequality to derive 

I maxw'(At) — maxu(At)| < | maxu(At) — max f(At)| 

t<T t<T t<T t<T 

+ I maxz;'(At) - max/'(At)| 

t<T t<T 

+ |max/'(At)-max/(At)|. 


with probability at least l — i5+e fforLl = fCiTf^TlT- 

Proof (Theorem 4). Let Z be a Laplace random variable 
with scale parameter b and location parameter 0; Z ^ 
Lapf). Then Pr[|Z| < ab] = 1 — e““. Thus, in Al¬ 
gorithm 1, |{; — maxt<T Vt\ < ab for b = ^ + f + f with 
probability at least 1 — e““. Further observe, 

ab > maxw/ — v > (max f(A/) — max |af|) — v 

t<T ^ t<T ■' ^ ' t<T ' 

>/(A*) - p - maxt<T |at| - u (8) 

where the second and third inequality follow from the 
proof of Theorem 3 (using the regret bound of Srinivas 
et al. (2010): Theorem 1). Note that the third inequal¬ 
ity holds with probability greater than 1 — f (given /3( in 
Algorithm 1). The final inequality implies /(A*) — v < 
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max«r jail + p + ab. Also note that, 

ab > V — maxvt > v — (max f(At) + max |at|) 

t<T t<T t<T 

> t; -/(A*) - p - maxt<T |at| (9) 

This implies that /(A*) — v > — ina,xt<T \ oLt \ — p — ab. 
Thus we have that lu —/(A*)| < maxi<T \at\ + ^+ab. Fi¬ 
nally, because \at \ could be arbitrarily large we give a high 
probability upper bound on \at \ for all t. Recall that for 
Zi,... Zn ~ A/'(0,1) we have by the tail probability bound 
and union bound that Pr[Vf, |^t| < 7] > 1 — ne”'*' = 

1 — I. Therefore, if we set Zt = at and n = T,w& obtain 
7 = a/ 2 log(2T/(5). As defined 7 > maxt<T |at|- ■ 

We note that, because releasing either A or u is (e, i5)- 
differentially private, by Corollaries 1 and 2, releasing 
both private quantities in Algorithm 1 guarantees (2e, 25)- 
differential privacy for validation dataset V. This is due 
to the composition properties of (e, 5)-differential privacy 
(Dwork et al., 2006a) (in fact stronger composition results 
can be demonstrated, (Dwork & Roth, 2013)). 

4. Without observation noise 

In hyper-parameter tuning it may be reasonable to assume 
that we can observe function evaluations exactly; vv,t = 
/v(At). First note that we can use the same algorithm to 
report the maximum A in the no-noise setting. Theorems 1 
and 2 still hold (note that g = 0 in Theorem 2). Flowever, 
we cannot readily report a private maximum / as the infor¬ 
mation gain 7 t in Theorems 3 and 4 approaches infinity as 
cr^ —0. Therefore, we extend results from the previous 
section to the exact observation case via the regret bounds 
of de Freitas et al. (2012). Algorithm 2 demonstrates how 
to privatize the maximum / in the exact observation case. 


(2012). Then we prove releasing / is (e, 5)-differentially 
private and that / is almost max 2 <t<T /(At). 

Global sensitivity. The following Theorem gives a 
bound on the global sensitivity of the maximum /. 

Theorem 5. Given Assumption 1 and the assumptions 
in Theorem 2 of de Freitas et al. (2012), for neighbor¬ 
ing datasets V, V' we have the following global sensitivity 
bound (in the Laplace mechanism sense). 


I max //At) — max /(At)| < Ae ^ 

2<t<T 2<t<T 

w.p. at least 1 — 5 for c = 2 ^(l —A:(V, V')) log(2|A|/5). 

We leave the proof to the supplementary material. 

Given this sensitivity, we may apply the Laplace mecha¬ 
nism to release /. 

Corollary 3. Let A{V) denote Algorithm 2 run on dataset 
V. Given assumption 1 and that / satisfies the assumptions 
of de Freitas et al. (2012), f is (e, 6)-differentially private, 
with respect to any neighboring dataset V', i.e., 

Pr[Al(V) = f\<ff Pr[Al(V') = f]+5. 

Even though we must add noise to the maximum / we show 
that / is still close to the optimal /(A*). 

Theorem 6. Given the assumptions of Theorem 3, we have 
the utility guarantee for Algorithm 2: 

|/-/(A*)|<f2 + a(? + f) 

_ 2t 

w.p. at least l — {6e~‘^)for n = Ae (GZATti_ 


Algorithm 2 Private Bayesian Opt. (noise free obs.) 


Input: V; A C T; (e, 5); A,r; assumptions on /y 
in de Freitas et al. (2012) 

Run method of de Freitas et al. (2012), resulting in noise 
free observations: /v(Ai),..., /v(At) 


c = 2 /{ 
Draw 0 ^ 


l-fc(V,V0)log(2|A|/5) 


Lap 


(!og2)‘^/4 _|_ C 


Return: / = max 2 <t<T /v(At) + 9 


We prove Corollary 3 and Theorem 6 in the supplementary 
material. We have demonstrated that in the noisy and noise- 
free settings we can release private near-optimal hyper¬ 
parameter settings A and function evaluations v, f. How¬ 
ever, the analysis thus far assumes the hyper-parameter set 
is finite: |A| < 00 . It is possible to relax this assumption, 
using an analysis similar to (Srinivas et al., 2010). We leave 
this analysis to the supplementary material. 

5. Without the GP assumption 


4.1. Private near-maximum validation gain 

We demonstrate that releasing / in Algorithm 2 is private 
(Theorem 3) and that a small amount of noise is added to 
make / private (Theorem 6 ). To do so, we derive the global 
sensitivity of max 2 <t<T /(At) in Algorithm 2 independent 
of the maximum information gain jt via de Freitas et al. 


Even if our our trae validation score / is not drawn from 
a Gaussian process (Assumption 1), we can still guarantee 
differential privacy for releasing its value after Bayesian 
optimization /®° = maxt<T f{^t)- In this section we 
describe a different functional assumption on / that also 
yields differentially private Bayesian optimization for the 
case of machine learning hyper-parameter tuning. 
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Assume we have a (nonsensitive) training set T = 
{(xi, which, given a hyperparameter A produces 

a model w(A) from the following optimization, 

Oa(w) 

/-^ 

A 1 " 

WA = argmin-||w|i 2 + - V^(w,x*, j/i), (10) 

W - 

1—1 

The function f is a training loss function (e.g., logistic 
loss, hinge loss). Given a (sensitive) validation set V = 
{(xi, i/j)}™ C A’ we would like to use Bayesian opti¬ 
mization to maximize a validation score /y. 

Assumption 2. Our true validation score f\> is 


In the following theorem we demonstrate that we can 
bound the change in / for arbitrary A < A'. 

Theorem 7. Given assumption 2, for neighboring V, V 
and arbitrary A < A' we have that, 

I/v(wa)-/v'(wa')I < 

where L is the Lipschitz constant of f, g* = 
x, y), and m is the size ofV. 

Proof. Applying the triangle inequality yields 

|/v(wa) - /v'(wa')I < I/v(wa) - /v(wa')I 
+ |/v(wa') - /v'(wa')I- 


^ 7/6 

/v(w(A)) =-V 5 (wA,x,,yJ, 

m ^' 
i=l 

where g{-) is a validation loss function that is L-Lipschitz 
in w (e.g., ramp loss, normalized sigmoid (Huang et al., 
2014)). Additionally, the training model wa is the mini- 
mizer of eq. (10) for a training loss f(-) that is 1-Lipschitz 
in w and convex (e.g., logistic loss, hinge loss). 

Algorithm 3 describes a procedure for privately releasing 
the best validation accuracy given assumption 2. Dif¬ 
ferent from previous algorithms, we may run Bayesian op¬ 
timization in Algorithm 3 with any acquisition function 
(e.g., expected improvement (Mockus et al., 1978), UCB) 
and privacy is still guaranteed. 


Algorithm 3 Private Bayesian Opt. (Lipschitz and convex) 
Input: T size n; V size m; A; Amin; Amaxi e; T; L; d 
Run Bayesian optimization for T timesteps, observing: 
/v(wAj,...,/v(wAj,)for{Ai,...,AT} = Ar,v ^ A 
/BO = maxt<T/v(wAj 
g* = max(x,y)eA:,wGK‘i x, y) 


Draw 0 ~ Lap 




1 .^min 


Return: = /^o -(- 0 


This second term is bounded by Chaudhuri & Vinterbo 
(2013) in the proof of Theorem 4. The only difference is, 
as we are not adding random noise to Wa' we have that 
|/v(wA') - /v'(wa')I < min{g*/TO,L/(mAmin) }■ 

To bound the first term, let Oa(w) be the value of the ob¬ 
jective in eq. (10) for a particular A. Note that Oa(w) and 
0\i (w) are A and A'-strongly convex. Define 

/i(w) = Oa'(w) - Oa(w) = ^^||w|||. (11) 

Further, define the minimizers wa = arg min.j^ 0\ (w) and 
wa' =argmin.^^[OA(w) + /t(w)]. This implies that 

VOa(wa) = VOa(wa 0 -f V/i(wA) = 0. (12) 


Given that 0\ is A-strongly convex (Shalev-Shwartz, 
2007), and by the Cauchy-Schwartz inequality. 


A ||wa-wa'||2 < VOa(wa) - VOa(wa') 


n T r 


< V/i(wa')^ 


wa-wa' 


wa-wa' 
< ||Vft.(wA')||2|lwA-WA'||2- 


Rearranging, 

^I|v/i(wa')I|2 


A'-A 

2 


V||WA' 


> |iwA-WA'||2 

(13) 


Similar to Algorithms 1 and 2 we use the Laplace mech¬ 
anism to mask the possible change in validation accuracy 
when V is swapped with a neighboring validation set V'. 
Different from the work of Chaudhuri & Vinterbo (2013) 
changing V to V' may also lead to Bayesian optimiza¬ 
tion searching different hyper-parameters, A^.v vs. At,v'- 
Therefore, we must bound the total global sensitivity of / 
with respect to V and A, 

Definition 5. The total global sensitivity of / over all 
neighboring datasets V, V' is 

Af = max |/v(wa) -/v'(wa')|- 
v,v'cx 
A.A'gA 


Now as wa' is the minimizer of Oy we have. 


V||wA'|ii = ^[- 

Substituting this value of wa' into eq. (13) and noting that 
we can pull the positive constant term (A' — A)/2 out of the 
norm and drop the negative sign in the norm gives us 

il|VMwA0l|2 = # 

The last equality follows from the fact that the loss £ is 
1-Lipschitz by Assumption 2 and the triangle inequality. 
Thus, along with eq. (13), we have 


^Er=i v^(wA',xi,?/i) 


A-A 
AA' • 


||wa-Wa'||2 < ^||V/i(wa')||2 < 


A'-A 


AA' ■ 
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Finally, as / is L-Lipschitz in w, 

|/v(wa) - /v(wa')I < -^-||WA - WA'||2 < 

Combining the result of Chaudhuri & Vinterbo (2013) with 
the above expression completes the proof. ■ 

Given a finite set of possible hyperparameters A, we would 
like to bound | /y — /y, |; the best validation score found 
when running Bayesian optimization on V vs. V'. Note 
that, by Theorem 7, 

|/v-/v'l ^ max|/v(wA)-/v'(wA')| 

^ TTlin i ^ ^ > 

- m > mA„i„ / > 

as (A'—A)/(A'A) is strictly increasing in A' strictly decreas¬ 
ing in A. Given this sensitivity of f* we can use the Laplace 
mechanism to hide changes in the validation set as follows. 

Corollary 4. Let A{V) denote Algorithm 3 applied on 
dataset V. Given assumption 2, is e-differentially pri¬ 
vate, i.e., Pr[Al(V) = /l] < e''Pr[Al(V') = /l] 


Ai,...,Aioo from a Sobol sequence and train an SVM 
model for each on the Forest UCI dataset (36, 603 train¬ 
ing inputs). We then randomly sample 100 i.i.d. valida¬ 
tion sets V. Here we describe the evaluation procedure for 
a fixed validation set size, which corresponds to a single 
curve in Figure 1 (as such, to generate all results we re¬ 
peat this procedure for each validation set size in the set 
{1000, 2000,3000, 5000,15000}). For each of the 100 val¬ 
idation sets we randomly add or remove an input to form 
a neighboring dataset V'. We then evaluate each of the 
trained SVM models on all 100 datasets V and their pairs 
V'. This results in two 100 x 100 (number of datasets, 
number of trained SVM models) function evaluation matri¬ 
ces Fy and Fy'. Thus, [FyJ^^ is the validation accuracy on 
the validation set Vi using the SVM model. 

The likelihood of function evaluations for a dataset pair 
(Vi, V{), for a value of fci (Vi, V(), is given by the marginal 
likelihood of the multi-task Gaussian process: 

^■■([Fyli) 


We leave the proof to the supplementary material. Further, 
by the exponential tails of the Laplace mechanism we have 
the following utility guarantee. 

Theorem 8. Given the assumptions of Theorem 7, we have 
the following utility guarantee for fj^ w.r.t. f^^. 


I/l - < a[—minjo*, -f 


a)L 


''max x'min 


with probability at least 1 — e “. 

Proof This follows exactly from the tail bound on Laplace 
random variables, given in the beginning of the proof of 
Theorem 6 . ■ 


6. Results 

In this section we examine the validity of our multi-task 
Gaussian process assumption on [/i,..., f 2 \x\]- Specifi¬ 
cally, we search for the most likely value of the multi-task 
Gaussian process covariance element fci(V, V'), for clas¬ 
sifier hyper-parameter tuning. Larger values of fci(V, V') 
correspond to a smaller global sensitivity bounds in Theo¬ 
rems 1,3, and 5 leading to improved privacy guarantees. 

For our setting of hyper-parameter tuning, each A = [C, 7 ^] 
are hyper-parameters for training a kernelized support vec¬ 
tor machine (SVM) (Cortes & Vapnik, 1995; Scholkopf & 
Smola, 2001) with cost parameter C and radial basis ker¬ 
nel width 7 ^. The value /y(A) is the accuracy of the SVM 
model trained with hyper-parameters A on V. 

To search for the most likely fci(V,V') we start by 
sampling 100 different SVM hyper-parameter settings 


where [Fy]* = [/y(Ai),... ,/y(Aioo)] (similarly for V') 
and ^100 and (t}qq are the posterior mean and variance of 
the multi-task Gaussian process using kernel fci(V, V') 0 
fc 2 (A, A') after observing [Fyj^ and [Fy/]i (for more details 
see Bonilla et al. (2008)). As piqq and depend on 
ki (V, V'), we treat it as a free-parameter and vary its value 
from 0.05 to 0.95 in increments of 0.05. For each value, 
we compute the marginal likelihood (14) for all validation 
datasets (Vi for z = 1,..., 100). As each Vi is sampled 
i.i.d. the joint marginal likelihood is simply the product of 
all Vi likelihoods. Computing this joint marginal likelihood 
for each ki{V, V') value yields a single curve of Figure 1. 
As shown, the largest values of A:i(V, V') = 0.95 is most 
likely, meaning that c in the global sensitivity bounds is 
quite small, leading to private values that are closer to their 
true optimums. 

7. Related work 

There has been much work towards differentially pri¬ 
vate convex optimization (Chaudhuri et al., 2011; Kifer 
et al., 2012; Duchi et al., 2013; Song et al., 2013; Jain & 
Thakurta, 2014; Bassily et al., 2014). The work of Bass- 
ily et al. (2014) established upper and lower bounds for 
the excess empirical risk of e and (e, (5)-differentially pri¬ 
vate algorithms for many settings including convex and 
strongly convex risk functions that may or may not be 
smooth. There is also related work towards private high¬ 
dimensional regression, where the dimensions outnum¬ 
ber the number of instances (Kifer et al., 2012; Smith & 
Thakurta, 2013a). In such cases the Hessian becomes sin¬ 
gular and so the loss is nonconvex. However, it is possible 
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dataset kernel value, fci(V, V') 


Figure 1. The log likelihood of a multi-task Gaussian process for 
different values of the kernel value fci (V, V'). The function eval¬ 
uations are the validation accuracy of SVMs with different hyper¬ 
parameters. 


to use the restricted strong convexity of the loss in the re¬ 
gression case to guarantee privacy. 

Differential privacy has been shown to be achievable in 
online and interactive kernel learning settings (Jain et al., 
2012; Smith & Thakurta, 2013b; Jain & Thakurta, 2013; 
Mishra & Thakurta, 2014). In general, non-private online 
algorithms are closest in spirit to the methods of Bayesian 
optimization. However, all of the previous work in dif¬ 
ferentially private online learning represents a dataset as 
a sequence of bandit arm pulls (the equivalent notion in 
Bayesian optimization is function evaluations /(xt)). In¬ 
stead, we consider functions in which changing a single 
dataset entry possibly affects all future function evalua¬ 
tions. Closest to our work is that of Chaudhuri & Vin- 
terbo (2013), who show that given a fixed set of hyper¬ 
parameters which are always evaluated for any validation 
set, they can return a private version of the index of the best 
hyper-parameter, as well as a private model trained with 
that hyper-parameter. Our setting is strictly more general 
in that, if the validation set changes, Bayesian optimization 
could search completely different hyper-parameters. 

Bayesian optimization, largely due to its principled han¬ 
dling of the exploration/exploitation trade-off of global, 
black-box function optimization, is quickly becoming 
the global optimization paradigm of choice. Alongside 
promising empirical results there is a wealth of recent 
work on convergence guarantees for Bayesian optimiza¬ 
tion, similar to those used in this work (Srinivas et al., 
2010; de Freitas et al., 2012). Vazquez & Beet (2010) and 
Bull (2011) give regret bounds for optimizing the expected 
improvement acquisition function each optimization step. 
BayesGap (Hoffman et al., 2014) gives a convergence guar¬ 
antee for Bayesian optimization with budget constraints. 
Bayesian optimization has also been extended to multi-task 
optimization (Bardenet et al., 2013; Swersky et al., 2013), 
the setting where multiple experiments can be run at once 


(Azimi et al., 2012; Snoek et al., 2012), and to constrained 
optimization (Gardner et al., 2014). 

8. Conclusion 

We have introduced methods for privately releasing the 
best hyper-parameters and validation accuracies in the case 
of exact and noisy observations. Our work makes use 
of the differential privacy framework, which has become 
commonplace in private machine learning (Dwork & Roth, 
2013). We believe we are the first to demonstrate differen¬ 
tially private quantities in the setting of global optimization 
of expensive (possibly nonconvex) functions, through the 
lens of Bayesian optimization. 

One key future direction is to design techniques to release 
each sampled hyper-parameter and validation accuracy pri¬ 
vately (during the run of Bayesian optimization). This 
requires analyzing how the maximum upper-confidence 
bound changes as the validation dataset changes. Another 
interesting direction is extending our guarantees in Sections 
3 and 4 to other acquisition functions. 

For the case of machine learning hyper-parameter tuning 
our results are designed to guarantee privacy of the valida¬ 
tion set only (it is equivalent to guarantee that the training 
set is never allowed to change). To simultaneously protect 
the privacy of the training set it may be possible to use tech¬ 
niques similar to the training stability results of Chaudhuri 
& Vinterbo (2013). Training stability could be guaranteed, 
for example, by assuming an additional training set kernel 
that bounds the effect of altering the training set on /. We 
leave developing these guarantees for future work. 

As practitioners begin to use Bayesian optimization in 
practical settings involving sensitive data, it suddenly be¬ 
comes crucial to consider how to preserve data privacy 
while reporting accurate Bayesian optimization results. 
This work presents methods to achieve such privacy, which 
we hope will be useful to practitioners and theorists alike. 

References 

Auer, Peter, Cesa-Bianchi, Nicolo, and Fischer, Paul. 
Finite-time analysis of the multiarmed bandit problem. 
Machine learning, 47(2-3):235-256, 2002. 

Azimi, Javad, Jalali, Ali, and Fern, Xiaoli Z. Hybrid batch 
bayesian optimization. In ICML 2012, pp. 1215-1222. 
ACM, 2012. 

Bardenet, Remi, Brendel, Matyas, Kegl, Balazs, and Se- 
bag, Michele. Collaborative hyperparameter tuning. In 
ICML, 2013. 

Bassily, Raef, Smith, Adam, and Thakurta, Abhradeep. 








Differentially Private Bayesian Optimization 


Private empirical risk minimization, revisited. arXiv 
preprint arXiv:1405.7085, 2014. 

Bergstra, James and Bengio, Yoshua. Random search 
for hyper-parameter optimization. JMLR, 13:281-305, 
2012 . 

Bonilla, Edwin, Chai, Kian Ming, and Williams, Christo¬ 
pher. Multi-task gaussian process prediction. In NIPS, 
2008. 

Bull, Adam D. Convergence rates of efficient global opti¬ 
mization algorithms. JMLR, 12:2879-2904, 2011. 

Chaudhuri, Kamalika and Vinterbo, Staal A. A stability- 
based validation procedure for differentially private ma¬ 
chine learning. In Advances in Neural Information Pro¬ 
cessing Systems, pp. 2652-2660, 2013. 

Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, 
Anand D. Differentially private empirical risk minimiza¬ 
tion. JMLR, 12:1069-1109, 2011. 

Chong, Miao M, Abraham, Ajith, and Paprzycki, 
Marcin. Traffic accident analysis using machine learning 
paradigms. Informatica (Slovenia), 29(l):89-98, 2005. 

Cortes, Corinna and Vapnik, Vladimir. Support-vector net¬ 
works. Machine learning, 20(3):273-297, 1995. 

de Freitas, Nando, Smola, Alex, and Zoghi, Masrour. Ex¬ 
ponential regret bounds for gaussian process bandits 
with deterministic observations. In ICML, 2012. 

Dinur, Irit and Nissim, Kobbi. Revealing information while 
preserving privacy. In Proceedings of the SIGMOD- 
SIGACT-SIGART symposium on principles of database 
systems, pp. 202-210. ACM, 2003. 

Duchi, John C, Jordan, Michael I, and Wainwright, Mar¬ 
tin J. Local privacy and statistical minimax rates. In 
FOGS, pp. 429^38. IEEE, 2013. 

Dwork, Cynthia and Roth, Aaron. The algorithmic founda¬ 
tions of differential privacy. Theoretical Computer Sci¬ 
ence, 9(3-4):211^07, 2013. 

Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry, 
Frank, Mironov, Ilya, and Naor, Moni. Our data, our¬ 
selves: Privacy via distributed noise generation. In Ad¬ 
vances in Cryptology-EUROCRYPT 2006, pp. 486-503. 
Springer, 2006a. 

Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and 
Smith, Adam. Calibrating noise to sensitivity in private 
data analysis. In Theory of Cryptography, pp. 265-284. 
Springer, 2006b. 


Ganta, Srivatsava Ranjit, Kasiviswanathan, Shiva Prasad, 
and Smith, Adam. Composition attacks and auxiliary in¬ 
formation in data privacy. In KDD, pp. 265-273. ACM, 
2008. 

Gardner, Jacob, Kusner, Matt, Xu, Zhixiang, Weinberger, 
Kilian, and Cunningham, John. Bayesian optimization 
with inequality constraints. In ICML, pp. 937-945, 2014. 

Hoffman, Matthew, Shahriari, Bobak, and de Freitas, 
Nando. On correlation and budget constraints in model- 
based bandit optimization with application to automatic 
machine learning. InAISTATS, pp. 365-374, 2014. 

Huang, Xiaolin, Shi, Lei, and Suykens, Johan AK. Ramp 
loss linear programming support vector machine. The 
Journal of Machine Learning Research, 15(1):2185- 
2211,2014. 

Hutter, Frank, Hoos, H. Holger, and Leyton-Brown, Kevin. 
Sequential model-based optimization for general algo¬ 
rithm configuration. In Learning and Intelligent Opti¬ 
mization, pp. 507-523. Springer, 2011. 

Jain, Prateek and Thakurta, Abhradeep. Differentially pri¬ 
vate learning with kernels. In ICML, pp. 118-126, 2013. 

Jain, Prateek and Thakurta, Abhradeep Guha. (near) di¬ 
mension independent risk bounds for differentially pri¬ 
vate learning. In ICML, pp. 476^84, 2014. 

Jain, Prateek, Kothari, Pravesh, and Thakurta, Abhradeep. 
Differentially private online learning. COLT, 2012. 

Kifer, Daniel, Smith, Adam, and Thakurta, Abhradeep. 
Private convex empirical risk minimization and high¬ 
dimensional regression. JMLR, 1:41,2012. 

Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near- 
optimal sensor placements in gaussian processes: The¬ 
ory, efficient algorithms and empirical studies. JMLR, 9: 
235-284, 2008. 

McSherry, Frank and Talwar, Kunal. Mechanism design via 
differential privacy. In FOCS, pp. 94—103. IEEE, 2007. 

Mishra, Nikita and Thakurta, Abhradeep. Private stochastic 
multi-arm bandits: From theory to practice. In ICML 
Workshop on Learning, Security, and Privacy, 2014. 

Mockus, Jonas, Tiesis, Vytautas, and Zilinskas, Antanas. 
The application of bayesian methods for seeking the ex¬ 
tremum. Towards Global Optimization, 2(117-129):2, 
1978. 

Narayanan, Arvind and Shmatikov, Vitaly. Robust de¬ 
anonymization of large sparse datasets. In IEEE Sympo¬ 
sium on Security and Privacy, pp. 111-125. IEEE, 2008. 




Differentially Private Bayesian Optimization 


Rasmussen, Carl Edward and Williams, Christopher K. I. 
Gaussian processes for machine learning. 2006. 

Scholkopf, Bernhard and Smola, Alexander J. Learning 
with kernels: support vector machines, regularization, 
optimization, and beyond. MIT press, 2001. 

Shalev-Shwartz, Shai. Online learning; Theory, algo¬ 
rithms, and applications. 2007. 

Smith, Adam and Thakurta, Abhradeep Guha. Differen¬ 
tially private feature selection via stability arguments, 
and the robustness of the lasso. In COLT, pp. 819-850, 
2013a. 

Smith, Adam and Thakurta, Abhradeep Guha. (nearly) 
optimal algorithms for private online learning in full- 
information and bandit settings. In NIPS, pp. 2133- 
2741, 2013b. 

Snoek, Jasper, Larochelle, Hugo, and Adams, Ryan R 
Practical bayesian optimization of machine learning al¬ 
gorithms. In NIPS, pp. 2951-2959, 2012. 

Song, Shuang, Chaudhuri, Kamalika, and Sarwate, 
Anand D. Stochastic gradient descent with differentially 
private updates. In IEEE Global Conference on Signal 
and Information Processing, 2013. 

Srinivas, Niranjan, Krause, Andreas, Kakade, Sham M, and 
Seeger, Matthias. Gaussian process optimization in the 
bandit setting: No regret and experimental design. In 
ICML, 2010. 

Sweeney, Latanya. Weaving technology and policy to¬ 
gether to maintain confidentiality. The Journal of Law, 
Medicine & Ethics, 25(2-3):98-l 10, 1997. 

Swersky, Kevin, Snoek, Jasper, and Adams, Ryan P. Multi¬ 
task bayesian optimization. In NIPS, pp. 2004—2012, 
2013. 

Vazquez, Emmanuel and Beet, Julien. Convergence prop¬ 
erties of the expected improvement algorithm with fixed 
mean and covariance functions. Journal of Statistical 
Planning and Inference, 140(11):3088-3095, 2010. 

Weinberger, Kilian, Dasgupta, Anirban, Langford, John, 
Smola, Alex, and Attenberg, Josh. Eeature hashing for 
large scale multitask learning. In ICML, pp. 1113-1120. 
ACM, 2009. 

Yu, Shipeng, Esbroeck, Alexander van, Earooq, Eaisal, 
Eung, Glenn, Anand, Vikram, and Krishnapuram, Bal- 
aji. Predicting readmission risk with institution specific 
prediction models. In IEEE International Conference 
on Healthcare Informatics (ICHI), pp. 415^20. IEEE, 
2013. 




